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Abstract 

Cross- language information retrieval (CLIR), where queries and documents are in different lan- 
guages, has of late become one of the major topics within the information retrieval community. 
This paper proposes a Japanese/English CLIR system, where we combine a query translation 
and retrieval modules. We currently target the retrieval of technical documents, and therefore 
the performance of our system is highly dependent on the quality of the translation of techni- 
cal terms. However, the technical term translation is still problematic in that technical terms 
are often compound words, and thus new terms are progressively created by combining existing 
base words. In addition, Japanese often represents loanwords based on its special phonogram. 
Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the 
first problem, we produce a Japanese/English dictionary for base words, and translate compound 
words on a word-by-word basis. We also use a probabilistic method to resolve translation ambigu- 
ity. For the second problem, we use a transliteration method, which corresponds words unlisted 
in the base word dictionary to their phonetic equivalents in the target language. We evaluate 
our system using a test collection for CLIR, and show that both the compound word translation 
and transliteration methods improve the system performance. 



1. Introduction 

Cross-language information retrieval (CLIR) is the retrieval process where the user presents queries in one 
language to retrieve documents in another language. One of the traditional research references for CLIR dates 
back to the 1960s flMongar, 19691 ). In the 1970s, Salton ( |l970| ; |l972| ) empirically showed that CLIR using a 



hand-crafted bilingual thesaurus is comparable with monolingual information retrieval in performance. The 
1990s witnessed a growing number of machine readable texts in various languages, including those accessible 
via the World Wide Web, but each content is usually provided in a limited number of languages. Thus, it 
is feasible that users are interested in retrieving information across languages. Possible users of CLIR are 
given below: 

• users who are able to read documents in foreign languages, but have difficulty formulating foreign 
queries, 

• users who find it difficult to retrieve/read relevant documents, but need the information, for the purpose 
of which the use of machine translation (MT) systems for the limited number of documents retrieved 
through CLIR is computationally more efficient rather than translating the entire collection, 

• users who know foreign keywords/phrases, and want to read documents associated with them, in their 
native language. 

In fact, CLIR has of late become one of the major topics within the information retrieval (IR), natural 
language processing (NLP) and artificial intelligence (AI) communities, and numerous CLIR systems have 
variously been proposed ( |AAAI, 1997| ; [ACM, 1996-1998| ; [MIST, 1992-1998j ). Note that CLIR can be seen as 



a subtask of multi- lingual information retrieval (MLIR), which also includes the following cases: 

• identify the query language (based on, for example, character codes), and search a multilingual collec- 
tion for documents in the query language, 

• retrieve documents, in which each document is in more than one language, 



retrieve documents using a query in more than one language (Fung et al., 1999) 



However, these above cases are beyond the scope of this paper. It should also be noted that while CLIR 
is not necessarily limited to IR within two languages, we consistently use the term "bilingual," keeping the 
potential applicability of CLIR to more than two languages in mind, because the variety of languages used 
is not the central issue of this paper. 

Since by definition queries and documents are in different languages, CLIR needs a translation process 
along with the conventional monolingual retrieval process. For this purpose, existing CLIR systems adopt 
various techniques explored in NLP research. In brief, dictionaries, corpora, thesauri and MT systems are 
used to translate queries and/or documents. However, due to the rudimentary nature of existing translation 
methods, CLIR still finds it difficult to achieve the performance of monolingual IR. Roughly speaking, 
recent experiments showed that the average precision of CLIR is 50-75% of that obtained with monolingual 



IR (Schauble and Sheridan, 1997), which stimulates us to further explore this exciting research area. 
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In this paper, we propose a Japanese/English bidirectional CLIR system targeting technical documents, 
which has been less explored than that for newspaper articles in past CLIR literature. Our research is 
partly motivated by the NACSIS test collection for (CL)IR systems, which consists of Japanese queries and 
Japanese/English abstracts collected from technical papers ( Kando et al., 1999|) .p| We will elaborate on 



the NACSIS collection in Section 5.1. As can be predicted, the performance of our CLIR system strongly 



depends on the quality of the translation of technical terms, which are often unlisted in general dictionaries. 



Pirkola ( 1998 ), for example, used a subset of the TREC collection related to health topics, and showed 
that a combination of general and domain specific (i.e., medical) dictionaries improves the CLIR perfor- 
mance obtained with only a general dictionary. This result shows the potential contribution of technical 
term translation to CLIR. At the same time, it should be noted that even domain specific dictionaries do not 



exhaustively list possible technical terms. For example, the EDR technical terminology dictionary ( Japan 



Electronic Dictionary Research Institute, 1995b| ), which consists of approximately 120,000 Japanese- English 



translations related to the information processing field, lacks recent terms like "jouhou chuushutsu (informa- 
tion extraction)." We classify problems associated with technical term translation as given below: 

• technical terms are often compound words, which can be progressively created simply by combining 
multiple existing morphemes ("base words"), and therefore it is not entirely satisfactory or feasible to 
exhaustively enumerate newly emerging terms in dictionaries, 

• Japanese often represents loanwords (i.e., technical terms and proper nouns imported from foreign 
languages) using its special phonetic alphabet (or phonogram) called "katakana" with which new 
words can be spelled out, 

• English technical terms are often abbreviated, which can be used as "Japanese" words. 

To counter the first problem, we propose a compound word translation method, which selects appropriate 
translations based on the probability of occurrence of each combination of base words in the target language 



(see Section 4.2). Note that technical compound words sometimes include general words, such as "AI chess" 
and "digital watermark. 11 In this paper, we do not rigorously define general words, by which we mean words 
that are contained in existing general dictionaries but rarely in technical term dictionaries. For the second 
problem, we propose a "transliteration" method, which identifies phonetic equivalents in the target language 
(see Section |4.3| ). Finally, to resolve the third problem, we enhance our bilingual dictionary with multiples 
of each abbreviation and its complete form (e.g., "IR" and "information retrieval") extracted from English 



corpora (see Section L4). Note that although a number of methods targeting those above problems have 
been explored in past research, no attempt has been made to integrate them in the context of CLIR. 

Section || surveys past research on CLIR, and clarifies our focus and approach. Section ^ overviews our 
CLIR system, and Section [l] elaborates on the translation method aimed to resolve the above problems 
associated with technical term translation. Section |^ then evaluates the performance of our CLIR system 
using the NACSIS collection. 



http : //www. rd.nacsis . ac . jp/~ntcadm/ index-en. html 
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2. Past Research on CLIR 
2.1. Retrieval Methodologies 

Figure [TJ classifies existing CLIR approaches in terms of retrieval methodology. The top level three 
categories correspond to the different titles of the following items. 



Query translation approach This approach translates queries into document languages using bilingual 
dictionaries or/and corpora, prior to the retrieval process. Since the retrieval process is fundamentally the 
same as performed in monolingual IR, the translation module can easily be combined with existing IR 
engines. This category can be further subdivided into the following three methods. 



The first subcategory can be called dictionary-based methods. Hull and Grefenstette ( 1996 ) used a 
bilingual dictionary to derive all possible translation candidates of query terms, which are used for the 
subsequent retrieval. Their method is easy to implement, but potentially retrieves irrelevant documents and 



decreases the time efficiency. To resolve this problem, Hull (1997) combined translation candidates for each 
query term with the "OR" operator, and used the weighted boolean method to assign an importance degree 
to each translation candidate. 



Pirkola (1998) also used structured queries, where each term is combined with different types of operators. 



Ballesteros and Croft (1997) enhanced the dictionary-based translation using the "local context analysis" (Xu 
and Croft, 1996| ) and phrase-based translation. Dorr and Oard ( 1998 ) evaluated the effectiveness of a 
semantic structure of a query in the query translation. As far as their comparative experiments were 
concerned, the use of semantic structures was not as effective as MT/dictionary-based query translation 
methods. 

The second subcategory, corpus-based methods, uses translations extracted from bilingual corpora, for 



the query translation ( Carboncll ct al., 1997 ). In this paper, "(bilingual) aligned corpora" generally refer 
to a pair of two language corpora aligned to each other on a word, sentence, paragraph or document basis. 
Given such resources, corpus-based methods are expected to acquire domain specific translations unlisted in 



existing dictionaries. In fact, Carbonell et al. ( 1997 ) empirically showed that their corpus-based query trans- 
lation method outperformed a dictionary-based method. Their comparative evaluation also showed that the 
corpus-based translation method outperformed GVSM/LSI-based methods (see the following "Interlingual 
representation approach" item for details of GVSM and LSI) . Note that for the purpose of corpus-based trans- 



lation methods, a number of translation extraction techniques explored in NLP research (Fung, 1995; Kaji 



and Aizono, 1996; smadja et al., 1996) are applicable. 

Finally, hybrid methods use corpora to resolve the translation ambiguity inherent in bilingual dictionaries. 
Unlike the corpus-based translation methods described above, which rely on bilingual corpora, Ballesteros 
and Chen et al. ( 1999] ) independently used a monolingual corpus for the disambiguation, 



and Croft 



and therefore the implementation cost is less. In practice, their method selects the combination of translation 
candidates that frequently co-occur in the target language corpus. On the other hand, bilingual corpora are 



also applicable to hybrid methods. Okumura ct al. ( |1998[ ) and Yamabana et al. ( [1996D independently used 
the same disambiguation method, in that they consider word frequencies in both the source and target 
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languages, obtained from a bilingual aligned corpus. Nie et al. ( |1999| ) automatically collected parallel texts 
in French and English from the World Wide Web, to train a probabilistic query translation model, and 
suggested its feasibility for CLIR. 



Davis and Ogden ( 1997 ) used a bilingual aligned corpus as the document collection for training retrieval. 
They first derive possible translation candidates using a dictionary. Then, training retrieval trials are per- 
formed on the bilingual corpus, in which the source and translated queries are used to retrieve source and 
target documents, respectively. Finally, they select translations which retrieved documents aligned to those 
retrieved with the source query. Note that this method provides a salient contrast to other query translation 
methods, in which translation is performed independently from the retrieval module. 

Chen et al. ( |1999| ) addressed the disambiguation of polysemy in the target language, along with the 
translation disambiguation, specifically in the case where a source query term corresponds to a small number 
of translations, but some of these translations are associated with a large number of word senses, the 
polysemous disambiguation is more crucial than the resolution of translation ambiguity. To counter this 
problem, source query terms are expanded with words that frequently co-occur, which are expected to 
restrict the meaning of polysemous words in the target language documents. 



Document translation approach This approach translates documents into query languages, prior to 
the retrieval. In most cases, existing MT systems are used to translate all the documents in a given collec- 



tion (|Gachot et al., 1996| ; [Kwon et al., 1998| ; |Oard, 1998| ). Otherwise, a dictionary-based method is used to 
translate only index terms ( |Aone et al., 1997 ). It is feasible that when compared with short queries, docu- 



ments contain a significantly higher volume of information for the translation. In fact, Oard (1998) showed 
that the document translation method using an MT system outperformed several types of dictionary-based 
query translation methods. 



However, McCarley (1999) showed that the relative superiority between query and document translation 
approaches varied depending on the source and target language pair. He also showed that a hybrid system 
(it should not be confused with one described in the "Query translation approach" item above), where the 
relevance degree of each document (i.e., the "score") is the mean of those obtained with query and docu- 
ment translation systems, outperformed systems based on either query or document translation approach. 
However, generally speaking, the full translation on large-scale collections can be prohibitive. 



Interlingual representation approach The basis of this approach is to project both queries and docu- 



ments in a language-independent (conceptual) space. In other words, as Salton fll970| ; [1972|) and Sheridan 



and Ballerini ( 1996 ) identified, the interlingual representation approach is based on query expansion methods 
proposed for monolingual IR. This category can be subdivided into thesaurus-based methods and variants 
of the vector space model (VSM) QSalton and McGill, 1983)) . 



Salton (197C; 1972) applied hand-crafted English/French and English/ German thesauri to the SMART 



system (Salton, 1971), and demonstrated that a CLIR version of the SMART system is comparable to the 



monolingual version in performance. The International Road Research Documentation scheme (Mongar 



19691 ) used a trilingual thesaurus associated with English, German and French. Gilarranz et al. (|1997j) 
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and Gonzalo et al. ( |1998| ) used the EuroWordNet multilingual thesaurus ( Vosscn, 1998| ). Unlike these 
above methods relying on manual thesaurus construction, Sheridan and Ballerini (1996) used a multilingual 
thesaurus automatically produced from an aligned corpus. 



The generalized vector space model (GVSM) ( Wong et al., 1985 ) and latent semantic indexing (LSI) (Deer- 



wester et al., 1990), which were originally proposed as variants of the vector space model for monolingual IR, 



project both queries and documents into a language-independent vector space, and therefore these methods 



can be applicable to CLIR. While Dumais et al. ( 1996 ) explored an LSI-based CLIR, Carbonell et al. ( 1997 ) 
empirically showed that GVSM outperformed LSI in terms of CLIR. Note that like thesaurus-based methods, 
GVSM/LSI-based methods require aligned corpora. 





query translation approach < 


' dictionary-based methods 
corpus-based methods 
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hybrid methods < 

monolingual corpora 
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1 index term translation 




interlingual representation approach < 


, f hand-crafted thesauri 
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vector space models < , .... 

I latent semantic indexing 



Figure 1: Classification of CLIR retrieval methods (the method we adopt is underlined) 



2.2. Presentation Methodologies 



In the case of CLIR, retrieved documents are not always written in the user's native language. Therefore, 
presentation methodology of retrieval results is a more crucial task than in monolingual IR. It is desirable to 
present smaller-sized contents with less noise, in other words, precision is often given more importance than 
recall for CLIR systems. Note that effective presentation is also crucial when a user and system interactively 



retrieve relevant documents, as performed in relevance feedback (Salton and McGill, 1983). 

However, a surprisingly small number of references addressing this issue can be found in past research 



literature. Aone et al. (1997) presented only keywords frequently appearing in retrieved documents, rather 
than entire documents. Note that since most CLIR systems use frequency information associated with index 



terms like "term frequency (TF)" and "inverse document frequency (IDF)" ( palton and McGill, 1983[) for 
the retrieval, frequently appearing keywords can be identified without an excessive additional computational 
cost. Experiments independently conducted by Oard and Resnik (199£) and Suzuki et al. ( 1998) ) showed 
that even a simple translation of keywords (such as using all possible translations defined in a dictionary) 
improved on the efficiency for users to find relevant foreign documents from the whole retrieval result. 



Suzuki et al. ( 1999 ) more extensively investigated the user's retrieval efficiency (i.e., the time efficiency and 
accuracy with which human subjects find relevant foreign documents) by comparing different presentation 



5 



methods, in which the following contents were independently presented to the user: 

1. keywords without translation, 

2. keywords translated with the first entry defined in a dictionary, 

3. keywords translated through the hybrid method (see the "Query translation approach" item in Sec- 



tion 



2.1), 



4. documents summarized (by an existing summarization software) and manually translated. 

Their comparative experiments showed that the third content was most effective in terms of the retrieval 
efficiency. 

For monolingual IR, automatic summarization methods based on the user's focus/query have recently 



been explored. Mani and Bloedorn ( 1998 ) used machine learning techniques to produce document summa- 
rization rules based on the user's focus (i.e., query). Tombros and Sanderson ( 1998] ) showed experimental 
results, in which presenting the fragment of each retrieved document containing query terms improved on the 
retrieval efficiency of human subjects. Applicability of these methods to CLIR needs to be further explored. 

2.3. Evaluation Methodologies 

From a scientific point of view, performance evaluation is invaluable for CLIR. In most cases, the eval- 
uation of CLIR is the same as performed for monolingual IR. That is, each system conducts a retrieval 
trial using a test collection consisting of predefined queries and documents in different languages, and then 
the performance is evaluated based on the precision and recall. Several experiments used test collections 
for monolingual IR in which either queries or documents were translated, prior to the evaluation. How- 



ever, as Sakai et al. ( 199E ) empirically showed, the CLIR performance varies depending on the quality of 
the translation of collections, and thus it is desirable to carefully produce test collections for CLIR. The 
production of test collections usually involves collecting documents, producing queries and relevance assess- 
ment for each query. However, since relevance assessment is expensive, especially for large-scale collections 



(even in the case where the pooling method (Voorhees, 1998) is used to reduce the number of candidates of 



relevant documents), Carbonell et al. (1997) first translated queries into the document language, and used 
as (pseudo) relevant documents those retrieved with the translated queries. In other words, this evaluation 
method investigates the extent to which CLIR maintains the performance of monolingual IR. 

For the evaluation of presentation methods, human subjects are often used to investigate the retrieval 



efficiency, as described in Section 1.1. However, evaluation methods involving human interactions are prob- 
lematic, because human subjects are in a way trained through repetitive retrieval trials for different systems, 
which can potentially bias the result. On the other hand, in the case where each subject uses a single system, 
difference of subjects affects the result. To minimize this bias, multiple subjects are usually classified based 
on, for example, their literacy in terms of the target language, and those falling into the same cluster are 
virtually regarded as the same person. However, this issue still remains an open question, and needs to be 
further explored. 
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2.4. Our Focus and Approach 

Through discussions in the above three sections, we identified the following points which should be taken 
into consideration for our research. 

For translation methodology, the query translation approach is preferable in terms of implementation cost, 
because this approach can simply be combined with existing IR engines. On the other hand, other approaches 
can be prohibitive, because (a) the document translation approach conducts the full translation on the entire 
collection, and (b) the interlingual representation approach requires alignment of bilingual thesauri/corpora. 
In fact, we do not have Japanese-English thesauri/corpora with sufficient volume of alignment information 
at present. One may argue that the NACSIS collection, which is a large-scale Japanese- English aligned 
corpora, can be used for the translation. However, note that bilingual corpora for the translation must 
not be obtained from the test collection used for the evaluation, because in real world usage one of the two 
language documents in the collection is usually missing. In other words, CLIR has little necessity for bilingual 
aligned document collections, in that the user can retrieve documents in the query language, without the 
translation process. 

However, at the same time we concede that each approach is worth further exploration, and in this paper 
we do not pretend to draw any premature conclusions regarding the relative merits of different approaches. 
To sum up, we focus mainly on translating sequences of content words included in queries, rather than the 
entire collection. Among different methods following the query translation approach, we adopt the hybrid 
method using a monolingual corpus. In other words, our translation method is relatively similar to that 
proposed by Ballesteros and Croft et al. ( 199S ) and Chen et al. ( I999|) . However, unlike their cases, we 
integrate word-based translation and transliteration methods within the query translation. 

For presentation methodology, we use keywords translated using the hybrid translation method, which 



were proven to be effective in comparative experiments by Suzuki et al. (1999) (in the case where retrieved 
documents are not in the user's native language). Note that for the purpose of the translation of keywords, we 
can use exactly the same method as performed for the query translation, because both queries and keywords 
usually consist of one or more content words. 



Finally, for the evaluation of our CLIR system we use the NACSIS collection (Kando et al., 1999). 
Since in this collection relevance assessment is performed between Japanese queries and Japanese/English 
documents, we can easily evaluate our system in terms of Japanese-English CLIR. On the other hand, the 



evaluation of English- Japanese CLIR is beyond the scope of this paper, because as discussed in Section 2.3 
the production of English queries has to be carefully conducted, and is thus expensive. Besides this, in 
this paper we do not evaluate our system in terms of presentation methodology, because experiments using 
human subjects is also expensive and still problematic. These remaining issues need to be further explored. 

3. System Overview 

Figure || depicts the overall design of our CLIR system, in which we combine a translator with an IR 
engine for monolingual retrieval. In the following, we briefly explain the retrieval process based on this figure. 
First, the translator processes a query in the source language (query in S) to output the translation 
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(query in T). For this purpose, the translator uses a dictionary to derive possible translation candidates and 
a collocation to resolve the translation ambiguity. Note that a user can utilize more than one translation 
candidate, because multiple translations are often appropriate for a single query. By the collocation, we 
mean bi-gram statistics associated with content words extracted from NACSIS documents. Since our system 
is bidirectional between Japanese and English, we tokenize documents with different methods, depending 
on their language. For English documents, the tokenization involves eliminating stopwords and identifying 



root forms for inflected content words. For this purpose, we use WordNet ( Fellbaum, 1998 ), which contains 
a stopword list and correspondences between inflected words and their root form. On the other hand, we 



segment Japanese documents into lexical units using the ChaSen morphological analyzer (Matsumoto et al. 



1997), which has commonly been used for much Japanese NLP research, and extract content words based 



on their part-of-speech information. 

Second, the IR engine searches the NACSIS collection for documents (docs in T) relevant to the translated 
query, and sorts them according to the degree of relevance, in descending order. Our IR engine is currently 
a simple implementation of the vector space model, in which the similarity between the query and each 
document (i.e., the degree of relevance of each document) is computed as the cosine of the angle between 
their associated vectors. We used the notion of TF-IDF for term weighting. Among a number of variations 
of term weighting methods flSalton and Buckley, 1988 ; Zobcl and Moffat, 1998| ), we tentatively implemented 
two alternative types of TF (term frequency) and one type of IDF (inverse document frequency), as shown 
in Equation (Q). 

TF = ft,d (standard formulation) 

TF = 1 + log(/t j( j) (logarithmic formulation) (1) 
IDF = log(^) 

Here, jt,d denotes the frequency that term t appears in document d, and n t denotes the number of documents 
containing term t. N is the total number of documents in the collection. The second TF type diminishes 
the effect of fd.t, and consequently IDF affects the similarity computation more. We shall call the first 
and second TF types "standard" and "logarithmic" formulations, respectively. For the indexing process, we 
first tokenize documents as explained above (i.e., we use WordNet and ChaSen for English and Japanese 
documents, respectively), and then conduct the word-based indexing. That is, we use each content word 
as a single indexing term. Since our focus in this paper is the query translation rather than the retrieval 
process, we do not explore other IR techniques, including query expansion and relevance feedback. 

Finally, in the case where retrieved documents are not in the user's native language, we extract keywords 
from retrieved documents, and translate them into the source language using the translator (KWs in S). Un- 



like existing presentation methods, where keywords are words frequently appearing in each document (Aone 



et al., 1997; Suzuki et al., 1998; Suzuki et al., 1999), we tentatively use author keywords. In the NACSIS 
collection, each document contains roughly 3-5 single/compound keywords provided by the author(s) of the 
document. In addition, since the NACSIS documents are relatively short abstracts (instead of entire papers), 
it is not entirely satisfactory to rely on the word frequency information. Note that even in the case where 
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retrieved documents are in the user's native language, presenting author keywords is expected to improve 
the retrieval efficiency. 

For future enhancement, we optionally use an MT system to translate entire documents retrieved (or 
only documents identified as relevant using author keywords) into the user's native language (docs in S). 
We currently use the Transer Japanese/English MT system, which combines a general dictionary consisting 
of 230,000 entries, and a computer terminology dictionary consisting of 100,000 entries.^ Note that the 
translation of the limited number of retrieved documents is less expensive than that of the whole collection, 



as performed in the document translation approach (see Section 2.1). 

In Section ^, we will explain the translator in Figure ^|, which involves compound word translation 
and transliteration methods. While our translation method is applicable to both queries and keywords in 
documents, in the following we shall call it the query translation method without loss of generality. 



I dictionary 
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( docs in S ) 
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Figure 2: The overall design of our CLIR system (S and T denote the source and target languages, respec- 
tively) 

4. Query Translation Method 

4.1. Overview 

Given a query in the source language, tokenization is first performed as for target documents, that is, we 
use WordNet and ChaSen for English and Japanese queries, respectively (see Section |J). We then discard 
stopwords and extract only content words. Here, "content words" refer to both single and compound words. 
Let us take the following English query as an example: 

improvement or proposal of data mining methods. 

For this query, we discard "or" and "of," to extract "improvement," "proposal" and "data mining methods." 
Thereafter, we translate each extracted content word on a word-by-word basis, maintaining the word order 



3 Developed by NOVA, Inc. 



9 



in the source language. A preliminary study showed that approximately 95% of compound technical terms 



defined in a bilingual dictionary (Ferbcr, 198E) maintain the same word order in both Japanese and English. 
Note that we currently do not consider relation (e.g., syntactic relation) between content words, and thus 
each content word is translated independently. In brief, our translation method consists of the following two 
phases: 

(1) derive all possible translations for base words, 

(2) resolve translation ambiguity using the collocation associated with base word translations. 

While phase (2) is the same for both Japanese-English and English-Japanese translations, phase (1) differs 
depending on the source language. In the case of English-Japanese translation, we simply consult our 
bilingual dictionary for each base word. However, transliteration is performed whenever base words unlisted 
in the dictionary are found. 

On the other hand, in the case of Japanese-English translation, we consider all possible segmentations of 
the input word, by consulting the dictionary, because Japanese compound words lack lexical segmentation.^] 
Then, we select such segmentations that consist of the minimal number of base words. This segmentation 



method parallels that for the Japanese compound noun analysis ( Kobayashi et al., 1994 ). During the 
segmentation process, the dictionary derives all possible translations for base words. At the same time, 
transliteration is performed only when katakana words unlisted in the base word dictionary are found. 

4.2. Compound Word Translation 

This section explains our compound word translation method based on a probabilistic model, focusing 
mainly on the resolution of translation ambiguity. After deriving possible translations for base words (by 
way of either consulting the base word dictionary or performing transliteration), we can formally represent 
the source compound word S and one translation candidate T as below. 

S = si,s 2 ,...,s n 
T = t\, t2, ■ ■ ■ , t n 

Here, s, denotes an i-th base word, and t{ denotes a translation candidate of Sj. Our task, i.e., to select 
the T which maximizes P(T\S), is transformed into Equation (^) through use of the Bayesian theorem, as 
performed in the statistical machine translation ( Brown ct al., 1993| ). 



argmaxP(T|S) = argirmxP(S|T) • P(T) (2) 



'For Japanese query terms used in our evaluation (see Section it), the average number of possible segmentations was 4.9. 
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In practice, in the case where the user utilizes more than one translation, T's with greater probabilities are 
selected. We approximate P(S\T) and P(T) using statistics associated with base words, as in Equation (0). 



P(S\T) « Y[P( Sl \U) 

i=l 
n-1 

P(T) » l[P(t t+1 \U) 



(3) 



One may notice that this approximation is analogous to that for the statistical part-of-speech tagging, where 



Si and ti in Equation (g) correspond to a word and one of its part-of-speech candidates, respectively (Church 
and Mercer, 1993| ). Here, we estimate P(ij+i \U) using the word-based bi-gram statistics extracted from target 
language documents (i.e., the collocation in Figure ||). Before elaborating on the estimation of P(si\U) we 
explain the way to produce our bilingual dictionary for base words, because P(si\ti) is estimated using this 
dictionary. 



For our dictionary production, we used the EDR technical terminology dictionary ( Japan Electronic 



Dictionary Research Institute, 1995b), which includes approximately 120,000 Japanese-English translations 



related to the information processing field. Since most of the entries are compound words, we need to 
segment Japanese compound words, and correlate Japanese-English translations on a word-by-word basis. 
However, the complexity of segmenting Japanese words becomes much greater as the number of component 
base words increases. In consideration of these factors, we first extracted 59,533 English words consisting 
of only two base words, and their Japanese translations. We then developed simple heuristics to segment 
Japanese compound words into two substrings. Our heuristics relies mainly on Japanese character types, i.e., 
"kanji" "katakana" "hiragana" alphabets and other characters like numerals. Note that kanji (or Chinese 
character) is the Japanese idiogram, and katakana and hiragana are phonograms. 

In brief, we segment each Japanese word at the boundary of different character types (or at the leftmost 
boundary for words containing more than one character type boundary). Although this method is relatively 
simple, a preliminary study showed that we can almost correctly segment words that are in one of the 
following forms: "CK," "CA," "AK" and "KA." Here, "C," "K" and "A" denote kanji, katakana and alphabet 
character sequences, respectively. For other combinations of character types, we identified one or more cases 
in which our segmentation method incorrectly performed. 

On the other hand, in the case where a given Japanese word consists of a single character type, we 
segment the word at the middle (or at the left-side of the middle character for words consisting of an odd 
number of characters). Note that roughly 90% of Japanese words consisting of four kanji characters can be 



correctly segmented at the middle ( Kobayashi et al., 1994 ). However, in the case where resultant substrings 
begin/end with characters that do not appear at the beginning/end of words (for example, Japanese words 
rarely begin with a long vowel), we shift the segmentation position to the right. 



Tsuji and Kageura (1997) used the HMM to segment Japanese compound words in an English- Japanese 
bilingual dictionary. Their method can also segment words consisting of more than two base words, and 
reportedly achieved an accuracy of roughly 80-90%, whereas our segmentation method is applicable only to 
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those consisting of two base words. However, while the HMM-based segmentation is expected to improve 
the quality of our dictionary production, in this paper we tentatively show that our heuristics-based method 
is effective for CLIR despite its simple implementation, by way of experiments (see Section |J) . 

As a result, we obtained 24,439 Japanese and 7,910 English base words. We randomly sampled 600 
compound words, and confirmed that 95% of those words were correctly segmented. Figure [3] shows a frag- 
ment of the EDR dictionary (after segmenting Japanese words), and Figure^ shows a base word dictionary 
produced from entries in Figure |3[ Figure || contains Japanese variants, such as memori / memorii for the 
English word "memory." We can easily produce a Japanese-English base word dictionary from Figure ||, 
using the same procedure. 

During the dictionary production, we also count the correspondence frequency for each combination of Sj 
and U, in order to estimate P(si\ti). In Figure^, for example, the Japanese base word "soukan" corresponds 
once to "associative," and twice to "correlation." Thus, we can derive Equation 

P (associative | soukan) = 1/3 
P (correlation | soukan) = 2/3 

However, in the case where is transliterated into ti, we replace P(si\ti) with a probabilistic score computed 
by our transliteration method (see Section |4.3| ) . 

One may argue that P(si\ti) should be estimated based on real world usage, i.e., bilingual corpora. 
However, such resources are generally expensive to obtain, and we do not have Japanese-English corpora 



with sufficient volume of alignment information at present (see Section 2.4 for more discussion) 



English Japanese 

CCD memory CCD memorii 

IC memory IC memori 

associative learning soukan gakushuu 

associative memory rensou memori 

associative record ketsugou rekoodo 

correlation function soukan kansuu 

error detection ayamari kenshutsu 

factor correlation inshi soukan 

hybrid IC haiburiddo shuusekikairo 

Figure 3: A fragment of the EDR technical terminology dictionary 

4.3. Transliteration 

This section explains our transliteration method, which identifies phonetic equivalent translations for 
words unlisted in the base word dictionary. 

Figure |B| shows example correspondences between English and (romanized) katakana words, where we 
insert hyphens between each katakana character for enhanced readability. The basis of our transliteration 



method is analogous to that for compound word translation described in Section 4.2. The formula for the 
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source word S and one transliteration candidate T are represented as below. 



S = si,s 2 ,...,s„ 
T = ti,t2, • • • ,t n 

Here, unlike the case of compound word translation, Si and U denote z-th "symbols" (which consist of one 
or more letters), respectively. To derive possible Sj's and ij's, we consider all possible segmentations of 
the source word S, by consulting a dictionary for symbols, namely the "transliteration dictionary." Then, 
we select such segmentations that consist of the minimal number of symbols. Note that unlike the case of 
compound word translation, the segmentation is performed for both Japanese-English and English- Japanese 
transliterations. 



English 


Japanese 


CCD 


CCD 


IC 


IC, shuusekikairo 


associative 


soukan, rensou, ketsugou 


correlation 


soukan 


detection 


kenshutsu 


error 


ayamari 


factor 


inshi 


function 


kansuu 


hybrid 


haiburiddo 


learning 


gakushuu 


memory 


memori, memorii 


record 


rekoodo 



Figure 4: A fragment of an English- Japanese base word dictionary produced from Figure 



English 


Japanese 


system 


shi-su-te-mu 


mining 


ma-i-ni-n-gu 


data 


dee-ta 


network 


ne-tto-waa-ku 


text 


te-ki-su-to 


collocation 


ko-ro-ke-i-sho-n 



Figure 5: Example correspondences between English and (romanized) Japanese katakana words 



Thereafter, we resolve the transliteration ambiguity based on the a probabilistic model similar to that for 
the compound word translation. To put it more precisely, we compute P(T\S) for each T using Equation Q), 
and select T's with greater probabilities. Note that T's must be correct words (that are indexed in the 
NACSIS document collection). However, Equation (|^), which approximates P(T) by combining P(^)'s for 
substrings of T, potentially assigns positive possibility values for incorrect (unindexed) words. 

In view of this problem, we estimate P(T) as the probability that T occurs in the document collection, 
and consequently the probability for unindexed words becomes zero. In practice, during the segmentation 
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process we simply discard such T"s that are unindexed in the document collection, so that we can enhance 
the computation for P(T|5)'s. On the other hand, we approximate P(S\T) as in Equation (||), and estimate 
P(si\ti) based on the correspondence frequency for each combination of S{ and i, in the transliteration 
dictionary. 

The crucial content here is the way to produce the transliteration dictionary, because such dictionar- 
ies have rarely been published. For the purpose of dictionary production, we used approximately 35,000 
katakana Japanese words and their English translations collected from the EDR technical terminology dic- 



tionary (Japan Electronic Dictionary Research Institute, 1995b) and bilingual dictionary (Japan Electronic 



Dictionary Research Institute, 1995a). To illustrate our dictionary production method, we consider Figure^ 
again. Looking at this figure, one may notice that the first letter in each katakana character tends to be 
contained in its corresponding English word. However, there are a few exceptions. A typical case is that 
since Japanese has no distinction between "L" and "R" sounds, the two English sounds collapse into the 
same Japanese sound. In addition, a single English letter may correspond to multiple katakana characters, 
such as "x" to "fci-sw" in "<text, te-ki-su-to>." To sum up, English and romanized katakana words are not 
exactly identical, but similar to each other. 

We first manually defined the similarity between the English letter e and the first romanized letter for 
each katakana character j, as shown in Table [l]. In this table, "phonetically similar" letters refer to a certain 
pair of letters, such as "L" and "R," for which we identified approximately twenty pairs of letters. We then 
consider the similarity for any possible combination of letters in English and romanized katakana words, 
which can be represented as a matrix, as shown in Figure |^. This figure shows the similarity between letters 
in "<text, te-ki-su-to>." We put a dummy letter "$," which has a positive similarity only to itself, at the 
end of both English and katakana words. 

One may notice that matching plausible symbols can be seen as finding the path which maximizes the 
total similarity from the first to last letters. The best path can efficiently be found by, for example, Dijk- 
stra's algorithm (Dijkstra, 1959). From Figure]^, we can derive the following correspondences: "<te, £e>," 



"<x, ki-su> v and "<t, £o>." In practice, to exclude noisy correspondences, we used only English- Japanese 
translations whose total similarity from the first to last letters is above a predefined threshold. The resul- 
tant transliteration dictionary contains 432 Japanese and 1018 English symbols, from which we estimated 

P{Si\ti). 

Table 1: The similarity between English letter e and Japanese letter j 





Condition 


Similarity 


e and j are 


identical 


3 


e and j are 


phonetically similar 


2 


both e and 


j are vowels or consonants 


1 


otherwise 








To evaluate our transliteration method, we extracted Japanese katakana words (excluding compound 



words) and their English translations from an English- Japanese dictionary (Nichigai Associates, 1996). We 
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Figure 6: An example matrix for English- Japanese symbol matching (arrows denote the best path) 



then discarded Japanese/English pairs that were not phonetically equivalent to each other, and were listed 
in the EDR dictionaries. For the resultant 248 pairs, the accuracy of our transliteration method was 65.3%. 

Thus, our transliteration method is less accurate than the word-based translation. For example, the 
katakana word "re-ji-su-ta (register/resistor)" is transliterated into "resister," "resistor" and "register," 
with the probability score in descending order. Note that Japanese seldom represents "resister" as "re-ji- 
su-ta" (whereas it can be theoretically correct when this word is written in katakana characters), because 
"resister" corresponds to more appropriate translations in kanji characters. However, the compound word 
translation is expected to select appropriate transliteration candidates. For example, "re-ji-su-ta" in the 
compound word "re-ji-su-ta tensou gengo (register transfer language)" is successfully translated, given a set 
of base words "tensou (transfer)" and "gengo (language)" as a context. 

Finally, we devote a little more space to compare our transliteration method and other related works. 



Chen et al. ( 199S ) proposed a Chinese-English transliteration method. Given a (romanized) source word, 
their methods compute the similarity between the source word and each target word listed in the dictionary. 
In brief, the more letters two words share in common, the more similar they are. In other words, unlike 
our case, their methods disregard the order of letters in source and target words, which potentially degrades 
the transliteration accuracy. In addition, since for each source word the similarity is computed between 
all the target words (or words that share at least one common letter with the source word), the similarity 



computation can be prohibitive. Lee and Choi (1997) explored English-Korean transliteration, where they 
automatically produced a transliteration model from a word-aligned corpus. In brief, they first consider 
all possible English-Korean symbol correspondences for each word alignment. Then, iterative estimation 
is performed to select such symbol correspondences that maximize transliteration accuracy on training 
data. However, when compared with our symbol alignment method, their iterative estimation method is 



computationally expensive. Knight and Graehl (1995) proposed a Japanese-English transliteration method 
based on the mapping probability between English and Japanese katakana sounds. However, while their 
method needs a large-scale phoneme inventory, we use a simpler approach using surface mapping between 
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English and katakana characters, as defined in our transliteration dictionary. Note that none of those 
above methods has been evaluated in the context of CLIR. Empirical comparison of different transliteration 
methods needs to be further explored. 

4.4. Further Enhancement of Translation 

This section explains two additional methods to enhance the query translation. 

First, we can enhance our base word dictionary with general words, because technical compound words 
sometimes include general words, as discussed in Section |IJ Note that in Section 4.2 we produced our 
base word dictionary from the EDR technical terminology dictionary. Thus, we used the EDR bilingual 



dictionary ( Japan Electronic Dictionary Research Institute, 1995a ), which consists of approximately 370,000 
Japanese-English translations aimed at general usage. However, unlike in the case of technical terms, it is not 
feasible to segment general compound words, such as "hot dog," into base words. Thus, we simply extracted 
162,751 Japanese and 67,136 English single words (i.e., words that consist of a single base word) from this 
dictionary. In addition, to minimize the degree of translation ambiguity, we use general translations only 
when (a) base words unlisted in our technical term dictionary are found, and (b) our transliteration method 
fails to output any candidates for those unlisted base words. 

Second, in Section [l] we also identified that English technical terms are often abbreviated, such as "IR" 
and "NLP," and they can be used as Japanese words. One solution would be to output those abbreviated 
words as they are, for both Japanese-English and English-Japanese translations. On the other hand, it is 
expected that we can improve the recall by using complete forms along with their abbreviated forms. To 
realize this notion, we extracted 7,307 tuples of each abbreviation and its complete form from the NACSIS 
English document collection, using simple heuristics. Our heuristics relies on the assumption that either 
abbreviations or complete forms often appear in parentheses headed by their counterparts, as shown below: 

Natural Language Processing (NLP), 
cross-language information retrieval (CLIR), 
MRDs (machine readable dictionaries). 

While the first example is the most straightforward, in the second and third examples we disregard a hyphen 
and lowercase letter (i.e., "s" in "MRDs"), respectively. In practice, we can easily extract such tuples using 
the regular expression pattern matching. Figure |?] shows example tuples of abbreviations and complete 
forms extracted from the NACSIS collection. In this figure, the column "Frequency" denotes the frequency 
that each tuple appears in the collection, with which we can optionally set a cut-off threshold for multiple 
complete forms corresponding to a single abbreviation (e.g., "information retrieval," "isoprene rubber" and 
"insulin receptor" for "IR" ) . 

5. Evaluation 

5.1. Methodology 

We investigated the performance of our system in terms of Japanese-English CLIR, based on the TREC- 
type evaluation methodology. That is, the system outputs 1,000 top documents, and the TREC evaluation 
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Abbreviation Complete form Frequency 

IR information retrieval 3 

IR isoprene rubber 1 

IR insulin receptor 1 

MT machine translation 11 

MT mobile telephone 3 

NLP natural language processing 8 

Figure 7: Example abbreviations and their complete forms 

software was used to plot recall-precision curves and calculate non-interpolated average precision values. 



For the purpose of our evaluation, we used a preliminary version of the NACSIS test collection ( Kando et 



al.. 1999). This collection includes approximately 330,000 documents (in either a combination of English and 
Japanese or either of the languages individually), collected from technical papers published by 65 Japanese 
associations for various fields.[] Each document consists of the document ID, title, name(s) of author(s), 
name/date of conference, hosting organization, abstract and keywords, from which we used titles, abstracts 
and keywords for the indexing. We used as target documents approximately 187,000 entries where abstracts 
are in both English and Japanese. 

This collection also includes 21 Japanese queries. Each query consists of the query ID, title of the topic, 
description, narrative and list of synonyms, from which we used only the description^] In general, most 
topics are related to electronic, information and control engineering. Figure || shows example descriptions 
(translated into English by one of the authors) . 



In the NACSIS collection, relevance assessment was performed based on the pooling method ( Voorhees 



1995). That is, candidates for relevant documents were first obtained with multiple retrieval systems. There- 
after, for each candidate document, human expert(s) assigned one of three ranks of relevance, i.e., "relevant," 
"partially relevant" and "irrelevant." The average number of candidate documents for each query is 4,400, 
among which the number of relevant and partially relevant documents are 144 and 13, respectively. In 
our evaluation, we did not regard partially relevant documents as relevant ones, because (a) the result did 
not significantly change depending on whether we regarded partially relevant as relevant or not, and (b) 
interpretation of partially relevant is not fully clear to the authors. 

Since the NACSIS collection does not contain English queries, we cannot estimate a baseline for Japanese- 
English CLIR performance based on English-English IR. Instead, we used a Japanese- Japanese IR system, 



5 The official version of the NACSIS collection includes 39 Japanese queries and the same document set as in the pre- 
liminary version we used. NACSIS (National Center for Science Information Systems, Japan) held a TREC-type (CL)IR 
contest workshop in August 1999, and participants, including the authors of this paper, were provided with the whole 
document set and 21 queries for training. These 21 queries are included in the final package of the test collection. See 



tittp : / /www . rd . nacsis . ac . jp/~nt cadm/workshop/ work-en . html 



for details. 



6 In the NACSIS workshop, each participant can submit more than one retrieval result using different systems. However, at 
least one result must be gained with only the description field. 
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which uses as documents Japanese titles/abstracts/keywords comparable to English fields in the NACSIS 
collection. One may argue that we can manually translate Japanese queries into English. However, as 



discussed in Section 2.3, the CLIR performance varies depending on the quality of translation, and thus we 
avoided an arbitrary evaluation. 



ID Description 

0005 dimension reduction for clustering 

0006 intelligent information retrieval using agent functions 
0019 syntactic analysis methods for Japanese sentences 
0024 machine translation systems 

Figure 8: Example query descriptions in the NACSIS collection 



5.2. Quantitative Comparison 

We compared the following query translation methods: 



all possible translations derived from the (original) EDR technical terminology dictionary ( Japan Elec- 



tronic Dictionary Research Institute, 1995b) are used for query terms, which can be seen as a lower 



bound method of this comparative experiment ("EDR"), 

• all possible base word translations derived from our base word dictionary are used ("ALL"), 

• fc-best translations selected by our compound word translation method are used, where transliteration 
is not used ("CWT"), 

• transliteration is performed for unlisted katakana words in CWT above, which represents the overall 
query translation method we proposed in this paper ( "TRL" ) . 

One may notice that both EDR and ALL correspond to the dictionary-based method, and CWT and TRL 
correspond to the hybrid method described in Section |2.l[ In the case of EDR, compound words unlisted in 
the EDR dictionary were manually segmented so that substrings (shorter compound words or base words) 
could be translated. There was almost no translation ambiguity in the case of EDR. In addition, preliminary 
experiments showed that disambiguation degraded the retrieval performance for EDR. In CWT and TRL, 
A: is a parametric constant, for which we set k = 1. Through preliminary experiments, we achieved the best 
performance when we set k = 1. By increasing the value of A:, we theoretically gain a query expansion effect, 
because multiple translations semantically related are used as query terms. However, in our case, additional 
translations were rather noisy with respect to the retrieval performance. Note that in this experiment, we 
did not used the general and abbreviation dictionaries. We will discuss the effect of those dictionaries in 



Section [14 . 

Table ^| shows the non-interpolated average precision values, averaged over the 21 queries, for different 
combinations of query translation and retrieval methods. It is worth comparing the effectiveness of query 
translation methods with different retrieval methods, because advanced retrieval methods potentially over- 
come the rudimentary nature of query translation methods, and therefore may overshadow the difference of 
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query translation methods in CLIR performance. In consideration of this problem, as described in Section 
we adopted two alternative term weighting methods, i.e., the standard and logarithmic formulations. In 
addition, we used as the IR engine in Figure || the SMART system ( Salton, 1971 ), where the augmented 
TFIDF term weighting method ("ATC") was used for both queries and documents. This makes it easy for 
other researchers to rigorously compare their query translation methods with ours within the same evaluation 
environment, because the SMART system is available to the public. 

In Table 0, J-J refers to the baseline performance, that is, the result obtained by the Japanese- Japanese 
IR system. Note that the performance of J-J using the SMART system is not available because this system 
is not implemented for the retrieval of Japanese documents. The column of Terms" denotes the average 
number of query terms used for the retrieval, where the number of terms used in ALL was approximately 
seven times as great as those of other methods. Suggestions can be derived from these results is as follows. 



Table 2: Non-interpolated average precision values, averaged over the 21 queries, for different combinations 
of query translation and retrieval methods 

Retrieval Method 

# of Terms Standard TF Logarithmic TF SMART 

J-J 4.0 0.2085 0.2443 — 

TRL 4.0 0.2427 0.2911 0.3147 

CWT 3.9 0.2324 0.2680 0.2770 

ALL 21 0.1971 0.2271 0.2106 

EDR 4.1 0.1785 0.2173 0.2477 



First, the relative superiority between EDR and ALL varies depending on the retrieval method. Since 
neither case resolved the translation ambiguity, the difference in performance for the two translation methods 
is reduced solely to the difference between the two dictionaries. Therefore, the base word dictionary we 
produced was effective when combined with the standard and logarithmic TF formulations. However, the 
translation disambiguation as performed in CWT improved the performance of ALL, and consequently CWT 
outperformed EDR irrespective of the retrieval method. To sum up, our compound word translation method 
was more effective than the use of an existing dictionary, in terms of CLIR performance. 

Second, by comparing results of CWT and TRL, one can see that our transliteration method further 
improved the performance of the compound word translation relying solely on the base word dictionary, 
irrespective of the retrieval method. Since TRL represents the overall performance of our system, it is 
worth comparing TRL and EDR (i.e., a lower bound method) more carefully. Thus, we used the paired 
t-test for statistical testing, which investigates whether the difference in performance is meaningful or simply 
due to chance ( Hull, 1993 ; Keen, 1992] ). We found that the average precision values of TRL and EDR are 
significantly different (at the 5% level), for any of the three retrieval methods. 

Third, the performance was generally improved as a more sophisticated retrieval method was used, for 
all of the translation methods excepting ALL. In other words, enhancements of the query translation and IR 
engine independently improved on the performance of our CLIR system. Note that the difference between 
the SMART system and the other two methods is due to more than one factor, including stemming and term 
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weighting methods. This suggests that our system may achieve a higher performance using other advanced 
IR techniques. 

Finally, TRL and CWT outperformed J-J for any of the retrieval methods. However, these differences 
are partially attributed to the different properties inherent in Japanese and English IR. For example, the 
performance of Japanese IR is more strongly dependent on the indexing method than English IR, since 
Japanese lacks lexical segmentation. This issue needs to be further explored. 

Figures |§-[ll] show recall-precision curves of different query translation methods, for different retrieval 
methods, respectively. In these figures, while the superiority of EDR and ALL in terms of precision varies 
depending on the recall, one can see that CWT outperformed EDR and ALL, and that TRL outperformed 
CWT, regardless of the recall. In Figures ^| and J- J generally performed better at lower recall while any of 
four CLIR methods performs better at higher recall. As discussed above, possible rationales would include the 
difference between Japanese and English IR. To put it more precisely, in Japanese IR a word-based indexing 
method (as performed in our IR engine) fails to retrieve documents in which words are inappropriately 
segmented. In addition, the ChaSen morphological analyzer often incorrectly segments katakana words, 
which frequently appear in technical documents. Consequently this drawback leads to a poor recall in the 
case of J-J. 
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Figure 9: Recall- precision curves using the standard TF 
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Figure 10: Recall-precision curves using the logarithmic TF 
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Figure 11: Recall-precision curves using the SMART system 
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5.3. Query-by-query Analysis 

In this Section, we discuss reasons why our translation method was effective in CLIR performance, 
through a query- by- query analysis. 



First, we compared EDR and CWT (see in Section 5.2), to investigate the effectiveness of our compound 
word translation method. For this purpose, we identified fragments of the NACSIS query that were correctly 
translated by CWT but not by EDR, as shown in Table [| In this table, where we insert hyphens between 
each Japanese base word for enhanced readability, Japanese/English words unlisted in the EDR technical 



terminology dictionary are underlined. Note that as mentioned in Section 5.2, in these cases translations 
for remaining base words were used as query terms. However, in the case of the query 0019, the EDR 
dictionary lists a phrase translation, i.e., "kakariuke-kaiseki (analysis of dependence relation)," and thus 
"analysis," "dependence" and "relation" were used as query terms ("of" was discarded as a stopword). One 
can see that except for the five cases asterisked, out of 18 cases, CWT outperformed EDR. Note that in the 
case of 0019, EDR conducted a phrase-based translation, while CWT conducted a word-based translation. 
The relative superiority between these two translation approaches varies depending on the retrieval method, 
and thus we cannot draw any conclusion regarding this point in this paper. In the case of the query 0006, 
although the translation in CWT was linguistically correct, we found that the English word "agent function" 
is rarely used in documents associated with agent research, and that "function" ended up degrading the 
retrieval performance. In the case of the query 0020, "loanword" would be a more appropriate translation 
for LL gairaigoP However, even when we used "loanword" for the retrieval, instead of "foreign" and "word," 
the performance of EDR did not change. 



Table 3: Query- by-query comparison between EDR and CWT 







Change in Average Precision (EDR - 


-> CWT) 


ID 


Japanese (Translation in CWT) 


Standard TF Logarithmic TF 


SMART 



0001 jiritsu-idou-robotto ( autonomous mobile robot) 

0004 bunsho-gazou-rikai ( document image understanding) 

0006 eejento-kinou (agent function ) 

0016 saidai-kyoutsuu-bubungurafu (greatest common subgraph) 

0019 kakariuke-kaiseki (dependency analysis) 

0020 katakana-gairai-go (katakana foreign word) 



0.2325 - 


•+ 0.3667 


0.2587 - 


-> 0.4058 


0.2259 


-+ 0.3441 


0.0011 - 


■* 0.2775 


0.0091 - 


-> 0.3768 


0.0217 


-+ 0.2740 


0.2008 - 


■* 0.1603* 


0.2920 - 


-> 0.1997* 


0.1430 


-+ 0.1395* 


0.1615 - 


-* 0.5039 


0.4661 - 


-> 0.6216 


0.1295 


-» 0.4460 


0.0794 - 


-> 0.3550 


0.1383 - 


-> 0.4302 


0.1852 


-+ 0.1449* 


0.4536 - 


-> 0.4568 


0.2408 - 


-» 0.4674 


0.9429 


-+ 0.8769* 



Second, we compared CWT and TRL in Table |], which uses the same basic notation as Table ||. The 
NACSIS query set contains 20 katakana base word types, among which "ma-i-ni-n-gu (mining)" and ll ko-ro- 
ke-i-sho-n (collocation)" were unlisted in our base word dictionary. Unlike the previous case, transliteration 
generally improved on the performance. On the other hand, we concede that only three queries are not 
enough to justify the effectiveness of our transliteration method. In view of this problem, we assumed that 
every katakana word in the query is unlisted in our base word dictionary, and compared the following two 
extreme cases: 

• every katakana word was untranslated (i.e., they were simply discarded from queries), which can be 
seen as a lower bound method in this comparison, 



22 



transliteration was applied to every katakana word, instead of consulting the base word dictionary. 



Both cases were combined into the CWT Section 5.2. Note that in the latter case, when a katakana word 
is included in a compound word, transliteration candidates of the word are disambiguated through the 
compound word translation method, and thus noisy candidates are potentially discarded. It should also be 
noted that in the case where a compound word consists of solely katakana words (e.g., deeta-mainingu (data 
mining)), our method automatically segments it into base words, by transliterating all the possible substrings. 

Table || shows the average precision values, averaged over the 21 queries, for those above cases. By 
comparing Tables | and |, one can see that the performance was considerably degraded when we disregard 
every katakana word, and that even when we applied transliteration to every katakana word, the performance 
was greater than that of CWT and was quite comparable to that of TRL. Among the 20 katakana base words, 
only "eejento (agent)" was incorrectly transliterated into "eagent," which was due to an insufficient volume 
of the transliteration dictionary. 



Table 4: Query-by-query comparison between CWT and TRL 



Change in Average Precision (CWT — ► TRL) 
ID Japanese (Translation in TRL) Standard TF Logarithmic TF SMART 
0008 deeta- mainingu (data mining ) 0.0018 0.0942 0.0299 ~* 0.3363 0.3156 -> 0.7295 
0012 deeta- mainingu (data mining ) 0.0018 -> 0.1229 0.0003 -> 0.1683 0.0000 -> 0.0853 
0015 corokeishon (collocation ) 0.0054 ~+ 0.0084 0.0389 -> 0.0485 0.0193 -> 0.3114 



Table 5: Non-interpolated average precision values, averaged over the 21 queries, for the evaluation of 
transliteration 







Retrieval Method 






# of Terms 


Standard TF Logarithmic TF 


SMART 


discard every katakana word 


2.8 


0.1519 0.1840 


0.1873 


transliterate every katakana word 


4.0 


0.2354 0.2786 


0.3024 



Finally, we discuss the effect of additional dictionaries, i.e., the general and abbreviation dictionaries. The 
NACSIS query set contains the general word "shimbun kiji (newspaper article)" and abbreviation "LFG (lex- 
ical functional grammar)" unlisted in our technical base word dictionary. The abbreviation dictionary lists 
the correct translation for "LFG." On the other hand, our general dictionary, which consists solely of single 
words, does not list the correct translation for ll shimbun-kiji." Instead, the English word "story" was listed 
as the translation, which would be used in a particular context. Table ^, where basic notation is the same 
as Table compares average precision values with/without these translations. From this table we cannot 
see any improvement with the additional dictionaries. However, when the correct translation was provided 
as in 0023 with "newspaper article," the performance was improved disregarding the retrieval method. In 
addition, since we found only two cases where additional dictionaries could be applied, this issue needs to 
be further explored using more test queries. 
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Tabic 6: Query- by-query comparison for the general and abbreviation dictionaries 

Change in Average Precision 
ID Japanese (Translation) Standard TF Logarithmic TF SMART 

0023 shimbun-kiji (story) 0.0003 -> 0.0000* 0.0000 -> 0.0000 0.0000 -> 0.0000 

0023 shimbun-kiji (newspaper article) 0.0003 -> 0.0200 0.0000 -> 0.0858 0.0000 -> 0.1800 
0025 LFG (lexical functional grammar) 0.8000 -> 0.5410* 0.8000 -> 0.6879* 0.9452 -> 0.8617* 



6. Conclusion 

Reflecting the rapid growth in utilization of machine readable multilingual texts in the 1990s, cross- 
language information retrieval (CLIR), which was initiated in the 1960s, has variously been explored in 
order to facilitate retrieving information across languages. For this purpose, a number of CLIR systems have 
been developed in information retrieval, natural language processing and artificial intelligence research. 

In this paper, we proposed a Japanese/English bidirectional CLIR system targeting technical documents, 
in that translation of technical terms is a crucial task. Since our research methodology must be contextualized 
in terms of past research literature, we surveyed existing CLIR systems, and classified them into three 
approaches: (a) translating queries into the document language, (b) translating documents into the query 
language, and (c) representing both queries and documents in a language-independent space. Among these 
approaches, we found that the first one, namely the query translation approach, is relatively inexpensive to 
implement. Therefore, following this approach, we combined query translation and monolingual retrieval 
modules. 

However, a naive query translation method relying on existing bilingual dictionaries does not guarantee 
sufficient system performance, because new technical terms are progressively created by combining existing 
base words or by the Japanese katakana phonograms. To counter this problem, we proposed compound word 
translation and transliteration methods, and integrated them within one framework. Our methods involve 
the dictionary production and probabilistic resolution of translation/transliteration ambiguity, both of which 
are fully automated. To produce the dictionary used for the compound word translation, we extracted base 
word translations from the EDR technical terminology dictionary. On the other hand, we corresponded 
English and Japanese katakana words on a character basis, to produce the transliteration dictionary. For 
the disambiguation, we used word frequency statistics extracted from the document collection. We also 
produced a dictionary for abbreviated English technical terms, to enhance the translation. 

From a scientific point of view, we investigated the performance of our CLIR system by way of the 
standardized IR evaluation method. For this purpose, we used the NACSIS test collection, which consists of 
Japanese queries and Japanese/English technical abstracts, and carried out Japanese-English CLIR evalua- 
tion. Our evaluation results showed that each individual method proposed, i.e., compound word translation 
and transliteration, improved on the baseline performance, and when used together the improvement was 
even greater, resulting in a performance comparable with Japanese- Japanese monolingual IR. Wc also showed 
that the enhancement of the retrieval module improved on our system performance, independently from the 
enhancement of the query translation module. 
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Future work will include improvement of each component in our system, and the effective presentation 
of retrieved documents using sophisticated summarization techniques. 



Acknowledgments 

The authors would like to thank Noriko Kando (National Institute of Informatics, Japan) for her support 
with the NACSIS collection. 

References 

[AAAI1997] AAAI. 1997. Electronic Working Notes of the AAAI Spring Symposium on Cross-Language Text 



and Speech Retrieval, ittp : //www . clis .umd. edu/dlrg/f ilter/sss/papers/. 



[ACM1996-1998] ACM SIGIR. 1996-1998. Proceedings of the Annual International ACM SIGIR Conference 
on Research and Development in Information Retrieval. 

[Aone et al.1997] Chinatsu Aonc, Nicholas Charocopos, and James Gorlinsky. 1997. An intelligent multilin- 
gual information browsing and retrieval system using information extraction. In Proceedings of the 5th 
Conference on Applied Natural Language Processing, pages 332-339. 

[Ballesteros and Croftl997] Lisa Ballesteros and W. Bruce Croft. 1997. Phrasal translation and query expan- 
sion techniques for cross-language information retrieval. In Proceedings of the 20th Annual International 
ACM SIGIR Conference on Research and Development in Information Retrieval, pages 84-91. 

[Ballesteros and Croftl998] Lisa Ballesteros and W. Bruce Croft. 1998. Resolving ambiguity for cross- 
language retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research 
and Development in Information Retrieval, pages 64-71. 

[Brown et al.1993] Peter F. Brown, Stephen A. Delia Pietra, Vincent J. Delia Pietra, and Robert L. Mer- 
cer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational 
Linguistics, 19(2):263-311. 

[Carbonell et al.1997] Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, 
and Danny Lee. 1997. Translingual information retrieval: A comparative evaluation. In Proceedings of 
the 15th International Joint Conference on Artificial Intelligence, pages 708-714. 

[Chen et al.1998] Hsin-Hsi Chen, Sheng-Jie Huang, Yung- Wei Ding, and Shih-Chung Tsai. 1998. Proper 
name translation in cross-language information retrieval. In Proceedings of the 36th Annual Meeting of 
the Association for Computational Linguistics and the 17th International Conference on Computational 
Linguistics, pages 232-236. 

[Chen et al.1999] Hsin-Hsi Chen, Guo-Wei Bian, and Wen-Cheng Lin. 1999. Resolving translation ambiguity 
and target polysemy in cross-language information retrieval. In Proceedings of the 37th Annual Meeting 
of the Association for Computational Linguistics, pages 215-222. 

[Church and Mercerl993] Kenneth W. Church and Robert L. Mercer. 1993. Introduction to the special issue 
on computational linguistics using large corpora. Computational Linguistics, 19(l):l-24. 

[Davis and Ogdenl997] Mark W. Davis and William C. Ogden. 1997. QUILT: Implementing a large-scale 
cross-language text retrieval system. In Proceedings of the 20th Annual International ACM SIGIR Con- 
ference on Research and Development in Information Retrieval, pages 92-98. 

[Deerwester et al.1990] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and 
Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for 
Information Science, 41(6):391-407. 

[Dijkstral959] Edsgar W. Dijkstra. 1959. A note on two problems in connexion with graphs. Numerische 
Mathematik, 1:269-271. 



25 



[Dorr and Oardl998] Bonnie J. Dorr and Douglas W. Oard. 1998. Evaluating resources for query translation 
in cross-language information retrieval. In Proceedings of the 1st International Conference on Language 
Resources and Evaluation, pages 759-764. 

[Dumais et al.1996] Susan T. Dumais, Thomas K. Landauer, and Michael L. Littman. 1996. Automatic 
cross- linguistic information retrieval using latent semantic indexing. In ACM SIGIR Workshop on Cross- 
Linguistic Information Retrieval. 

[Fellbauml998] Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press. 

[Ferberl989] Gene Ferber. 1989. English-Japanese, Japanese-English Dictionary of Computer and Data- 
Processing Terms. MIT Press. 

[Fung et al.1999] Pascale Fung, Liu Xiaohu, and Cheung Chi Shun. 1999. Mixed language query disambigua- 
tion. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 
333-340. 

[Fungl995] Pascale Fung. 1995. A pattern matching method for finding noun and proper noun translations 
from noisy parallel corpora. In Proceedings of the 33rd Annual Meeting of the Association for Computa- 
tional Linguistics, pages 236-243. 

[Gachot et al.1996] Denis A. Gachot, Elkc Langc, and Jin Yang. 1996. The SYSTRAN NLP browser: An 
application of machine translation technology in multilingual information retrieval. In ACM SIGIR 
Workshop on Cross-Linguistic Information Retrieval. 

[Gilarranz et al.1997] Julio Gilarranz, Julio Gonzalo, and Felisa Verdejo. 1997. An approach to conceptual 
text retrieval using the EuroWordNet multilingual semantic database. In Electronic Working Notes of 
the AAAI Spring Symposium on Cross-Language Text and Speech Retrieval. 

[Gonzalo et al.1998] Julio Gonzalo, Felisa Verdejo, Carol Peters, and Nicoletta Calzolari. 1998. Applying 
EuroWordNet to cross-language text retrieval. Computers and the Humanities, 32:185-207. 

[Hull and Grefenstettel996] David A. Hull and Gregory Grcfenstctte. 1996. Querying across languages: 
A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual 
International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 
49-57. 

[Hulll993] David Hull. 1993. Using statistical testing in the evaluation of retrieval experiments. In Pro- 
ceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, pages 329-338. 

[Hulll997] David A. Hull. 1997. Using structured queries for disambiguation in cross-language information 
retrieval. In Electronic Working Notes of the AAAI Spring Symposium on Cross-Language Text and 
Speech Retrieval. 

[Japan Electronic Dictionary Research Institutel995a] Japan Electronic Dictionary Research Institute. 
1995a. Bilingual dictionary. (In Japanese). 

[Japan Electronic Dictionary Research Institutel995b] Japan Electronic Dictionary Research Institute. 
1995b. Technical terminology dictionary (information processing). (In Japanese). 

[Kaji and Aizonol996] Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting word correspondences from 
bilingual corpora based on word co-occurrence information. In Proceedings of the 16th International 
Conference on Computational Linguistics, pages 23-28. 

[Kando et al.1999] Noriko Kando, Kazuko Kuriyama, and Toshihiko Nozue. 1999. NACSIS test collection 
workshop (NTCIR-1). In Proceedings of the 22nd Annual International ACM SIGIR Conference on 
Research and Development in Information Retrieval, pages 299-300. 



26 



[Keenl992] E. Michael Keen. 1992. Presenting results of experimental retrieval comparisons. Information 
Processing & Management, 28(4):491-502. 

[Knight and Graehll998] Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational 
Linguistics, 24(4):599-612. 

[Kobayashi et al.1994] Yoshiyuki Kobayashi, Takenobu Tokunaga, and Hozumi Tanaka. 1994. Analysis of 
Japanese compound nouns using collocational information. In Proceedings of the 15th International 
Conference on Computational Linguistics, pages 865-869. 

[Kwon et al.1998] Oh-Woog Kwon, Insu Kang, Jong-Hyeok Lee, and Geunbae Lee. 1998. Conceptual cross- 
language text retrieval based on document translation using Japanese-to-Korean MT system. Interna- 
tional Journal of Computer Processing of Oriental Languages, 12(1):1-16. 

[Lee and Choil997] Jae Sung Lee and Key-Sun Choi. 1997. A statistical method to generate various foreign 
word transliterations in multilingual information retrieval system. In Proceedings of the 2nd International 
Workshop on Information Retrieval with Asian Languages, pages 123-128. 

[Mani and Bloedornl998] Inderjeet Mani and Eric Bloedorn. 1998. Machine learning of generic and user- 
focused summarization. In Proceedings of AAAI/IAAI-98, pages 821-826. 

[Matsumoto et al.1997] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Osamu Imaichi, and Tomoaki 
Imamura. 1997. Japanese morphological analysis system ChaSen manual. Technical Report NAIST-IS- 
TR97007, NAIST. (In Japanese). 

[McCarleyl999] J. Scott McCarley. 1999. Should we translate the documents or the queries in cross-language 
information retrieval? In Proceedings of the 37th Annual Meeting of the Association for Computational 
Linguistics, pages 208-214. 

[Mongarl969] P.E. Mongar. 1969. International co-operation in abstracting services for road engineering. 
The Information Scientist, 3:51-62. 

[Nichigai Associatesl996] Nichigai Associates. 1996. English-Japanese computer terminology dictionary. (In 
Japanese) . 

[Nie et al.1999] Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language 
information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In 
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, pages 74-81. 

[NIST1992-1998] National Institute of Standards fc Tec hnology. 1992-1998. Proceedings of the Text REtrieval 



Conferences, http : //tree . nist . gov/pubs . html 



[Oard and Resnikl999] Douglas W. Oard and Philip Resnik. 1999. Support for interactive document selection 
in cross-language information retrieval. Information Processing & Management, 35(3):363-379. 

[Oardl998] Douglas W. Oard. 1998. A comparative study of query and document translation for cross- 
language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine 
Translation in the Americas, pages 472-483. 

[Okumura et al.1998] Akitoshi Okumura, Kai Ishikawa, and Kenji Satoh. 1998. Translingual information 
retrieval by a bilingual dictionary and comparable corpus. In The 1st International Conference on Lan- 
guage Resources and Evaluation, Workshop on Translingual Information Management: Current Levels 
and Future Abilities. 

[Pirkolal998] Ari Pirkola. 1998. The effects of query structure and dictionary setups in dictionary-based cross- 
language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference 
on Research and Development in Information Retrieval, pages 55-63. 



27 



[Sakai et al.1999] Tetsuya Sakai, Masahiro Kajiura, Kazuo Sumita, Gareth Jones, and Nigel Collier. 1999. A 
study on English- Japancse/Japancsc-English cross-language information retrieval using machine transla- 
tion. Transactions of Information Processing Society of Japan, 40(ll):4075-4086. (In Japanese). 

[Salton and Buckleyl988] Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in 
automatic text retrieval. Information Processing & Management, 24(5):513-523. 

[Salton and McGilll983] Gerard Salton and Michael J. McGill. 1983. Introduction to Modern Information 
Retrieval. McGraw-Hill. 

[Saltonl970] Gerard Salton. 1970. Automatic processing of foreign language documents. Journal of the 
American Society for Information Science, 21(3):187-194. 

[Saltonl971] Gerard Salton. 1971. The SMART Retrieval System: Experiments in Automatic Document 
Processing. Prentice-Hall. 

[Saltonl972] Gerard Salton. 1972. Experiments in multi-lingual information retrieval. Technical Report TR 
72-154, Computer Science Department, Cornell University. 

[Schauble and Shcridanl997] Peter Schauble and Paraic Sheridan. 1997. Cross-language information retrieval 
(CLIR) track overview. In The 6th Text Retrieval Conference. 

[Sheridan and Ballerinil996] Paraic Sheridan and Jean Paul Ballerini. 1996. Experiments in multilingual 
information retrieval using the SPIDER system. In Proceedings of the 19th Annual International ACM 
SIGIR Conference on Research and Development in Information Retrieval, pages 58-65. 

[Smadja ct al.1996] Frank Smadja, Kathleen R. McKcown, and Vasileios Hatzivassiloglou. 1996. Translating 
collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1):1— 38. 

[Suzuki et al.1998] Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto. 1998. Effect on displaying trans- 
lated major keywords of contents as browsing support in cross-language information retrieval. Information 
Processing Society of Japan SIGNL Notes, 98(63):99~106. (In Japanese). 

[Suzuki ct al.1999] Masami Suzuki, Naomi Inoue, and Kazuo Hashimoto. 1999. Effects of partial translation 
for users' document selection in cross-language information retrieval. In Proceedings of The 5th Annual 
Meeting of The Association for Natural Language Processing, pages 371-374. (In Japanese). 

[Tombros and Sandersonl998] Anastasios Tombros and Mark Sanderson. 1998. Advantages of query bi- 
ased summaries in information retrieval. In Proceedings of the 21st Annual International ACM SIGIR 
Conference on Research and Development in Information Retrieval, pages 2-10. 

[Tsuji and Kageural997] Keita Tsuji and Kyo Kageura. 1997. An HMM-based method for segmenting 
Japanese terms and keywords based on domain-specific bilingual corpora. In Proceedings of the 4th 
Natural Language Processing Pacific Rim Symposium, pages 557-560. 

[Voorhecsl998] Ellen M. Voorhces. 1998. Variations in relevance judgments and the measurement of retrieval 
effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and 
Development in Information Retrieval, pages 315-323. 

[Vossenl998] Piek Vossen. 1998. Introduction to EuroWordNct. Computers and the Humanities, 32:73-89. 

[Wong et al.1985] S.K.M. Wong, W. Siarko, and P.C.N. Wong. 1985. Generalized vector space model in 
information retrieval. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research 
and Development in Information Retrieval, pages 18-25. 

[Xu and Croftl996] Jinxi Xu and W. Bruce Croft. 1996. Query expansion using local and global document 
analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and 
Development in Information Retrieval, pages 4-11. 



28 



[Yamabana et al.1996] Kiyoshi Yamabana, Kazunori Muraki, Shinichi Doi, and Shin'ichiro Kamci. 1996. A 
language conversion front-end for cross- linguistic information retrieval. In A CM SIGIR Workshop on 
Cross-Linguistic Information Retrieval. 

[Zobel and Moffat 1998] Justin Zobcl and Alistair Moffat. 1998. Exploring the similarity space. ACM SIGIR 
FORUM, 32(l):18-34. 



29 



